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Abstract 

For  many  types  of  learners  one  can  compute  the  statistically  “optimal”  way  to  select  data.  We  review  how 
these  techniques  have  been  used  with  feedforward  neural  networks  [MacKay,  1992;  Cohn,  1994].  We  then 
show  how  the  same  principles  may  be  used  to  select  data  for  two  alternative,  statistically-based  learning 
architectures:  mixtures  of  Gaussians  and  locally  weighted  regression.  While  the  techniques  for  neural 
networks  are  expensive  and  approximate,  the  techniques  for  mixtures  of  Gaussians  and  locally  weighted 
regression  are  both  efficient  and  accurate. 
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1  ACTIVE  LEARNING  - 
BACKGROUND 

An  active  learning  problem  is  one  where  the  learner  has 
the  ability  or  need  to  influence  or  select  its  own  training 
data.  Many  problems  of  great  practical  interest  allow 
active  learning,  and  many  even  require  it. 

We  consider  the  problem  of  actively  learning  a  map¬ 
ping  A'  —  Y  based  on  a  set  of  training  examples 
{(i-,.  where  Xi  £  X  and  yi  6  V.  The  learner 

is  allowed  to  iteratively  select  new  inputs  x  (possibly 
from  a  constrained  set),  observe  the  resulting  output  y. 
and  incorporate  the  new  examples  (x,  y)  into  its  training 
set. 

The  primary  question  of  active  learning  is  how  to 
choose  which  x  to  try  next.  There  are  many  heuristics  for 
choosing  x  based  on  intuition,  including  choosing  places 
where  we  don’t  have  data  [Whitehead,  1991],  where  we 
perform  poorly  [Linden  and  Weber,  1993],  where  we  have 
low  confidence  [Thrun  and  Moller,  1992],  where  we  ex¬ 
pect  it  to  change  our  model  [Cohn  et  al,  1990],  and 
where  we  previously  found  data  that  resulted  in  learning 
[Schmidhuber  and  Storck,  1993]. 

In  this  paper  we  consider  how  one  may  select  x  “op¬ 
timally”  from  a  statistical  viewpoint.  We  first  review 
how  the  statistical  approach  can  be  applied  to  neu¬ 
ral  networks,  as  described  in  MacKay  [1992]  and  Cohn 
[1994].  We  then  consider  two  alternative,  statistically- 
based  learning  architectures:  mixtures  of  Gaussians  and 
locally  weighted  regression.  While  optimal  data  selec¬ 
tion  for  a  neural  network  is  computationally  expensive 
and  approximate,  we  find  that  optimal  data  selection  for 
the  two  statistical  models  is  efficient  and  accurate. 


tures,  however,  provide  an  estimate  of  P{y\x)  based  on 
current  data,  so  we  can  use  this  estimate  to  compute  the 
expectation  of  aj.  Selecting  x  to  minimize  the  expected 
integrated  variance  provides  a  solid  statistical  basis  for 
choosing  new  examples. 

2.1  EXAMPLE;  ACTIVE  LEARNING  WITH 
A  NEURAL  NETWORK 

In  this  section  we  review  the  use  of  techniques  from  Op¬ 
timal  Experiment  Design  (OED)  to  minimize  the  es¬ 
timated  variance  of  a  neural  network  [Fedorov,  1972; 
MacKay,  1992;  Cohn,  1994].  We  will  assume  we  have 
been  given  a  learner  y  =  /u  ( )^  a  training  set  {{xi,yi)}^i 
and  a  parameter  vector  w  that  maximizes  a  likeli¬ 
hood  measure.  One  such  measure  is  the  minimum  sum 
squared  residual 

1=1 

The  estimated  output  variance  of  the  network  is 

,...2  fdy(x)Y  fd^S^yWdyix) 

[  dw  j  [dwy  [  dw 

The  standard  OED  approach  assumes  normality  and 
local  linearity.  These  assumptions  allow  replacing  the 
distribution  P{y\x)  by  its  estimated  mean  y{x)  and  vari¬ 
ance  5^.  The  expected  value  of  the  new  variance,  5-?,  is 
then: 

(^?)  «  (T?  -  .  [MacKay,  1992].  (2) 


2  ACTIVE  LEARNING  -  A 
STATISTICAL  APPROACH 

We  denote  the  learner’s  output  given  input  x  as  y(x). 
The  mean  squared  error  of  this  output  can  be  expressed 
as  the  sum  of  the  learner’s  bias  and  variance.  The  vari¬ 
ance  crr(x)  indicates  the  learner’s  uncertainty  in  its  esti¬ 
mate  at  x.^  Our  goal  will  be  to  select  a  new  example  x 
such  that  when  the  resulting  example  (x,y)  is  added  to 
the  training  set,  the  integrated  variance  IV  is  minimized; 

IV  =  J  (T?P(x)dx.  (1) 

Here.  P{x)  is  the  (known)  distribution  over  X .  In  prac¬ 
tice.  we  will  compute  a  Monte  Carlo  approximation  of 
this  integral,  evaluating  cr?  at  a  number  of  random  points 
drawn  according  to  P{x). 

Selecting  x  so  as  to  minimize  IV  requires  comput¬ 
ing  a?,  the  new  variance  at  x  given  (x,y).  Until  we 
actually  commit  to  an  x,  we  do  not  know  what  corre¬ 
sponding  y  we  will  see,  so  the  minimization  cannot  be 
performed  deterministically.^  Many  learning  architec- 

Tffiless  explicitly  denoted,  y  and  (t|  are  functions  of  z. 
For  simplicity,  we  present  our  results  in  the  univariate  setting. 
.\11  results  in  the  paper  extend  easily  to  the  multivariate  case. 

^This  contrasts  with  related  work  by  Plutowski  and  White 
[1993],  which  is  concerned  with  filtering  an  existing  data  set. 


where  we  define 


cry(x,x)  = 


dy(x)\ 
dw  j 


T 


For  empirical  results  on  the  predictive  power  of  Equa¬ 
tion  2,  see  Cohn  [1994]. 

The  advantages  of  minimizing  this  criterion  are  that 
it  is  grounded  in  statistics,  and  is  optimal  given  the  as¬ 
sumptions.  Furthermore,  the  criterion  is  continuous  and 
differentiable.  As  such,  it  is  applicable  in  continuous 
domains  with  continuous  action  spaces,  and  allows  hill¬ 
climbing  to  find  the  “best"  x. 

For  neural  networks,  however,  this  approach  has  many 
disadvantages.  The  criterion  relies  on  simplifications 
and  strong  assumptions  which  hold  only  approximately. 
Computing  the  variance  estimate  requires  inversion  of  a 
]«;]  X  jm)  matrix  for  each  new  example,  and  incorporat¬ 
ing  new  examples  into  the  network  requires  expensive 
retraining.  Paass  and  Kindermann  [1995]  discuss  an  ap¬ 
proach  which  addresses  some  of  these  problems. 


3  MIXTURES  OF  GAUSSIANS 

The  mixture  of  Gaussians  model  is  gaining  popularity 
among  machine  learning  practitioners  [Nowlan,  1991; 
Specht,  1991;  Ghahramani  and  Jordan,  1994].  It  as¬ 
sumes  that  the  data  is  produced  by  a  mixture  of  N  Gaus¬ 
sians  Qi,  for  i  =  1, ...,  iV.  We  can  use  the  EM  algorithm 


[Dempster  et  al,  1977]  to  find  the  best  fit  to  the  data, 
after  which  the  conditional  expectations  of  the  mixture 
can  be  used  for  function  approximation. 

For  each  Gaussian  gi  we  will  denote  the  estimated  in¬ 
put/output  means  as  and  /iy,,-  and  estimated  covari¬ 
ances  as  crz  (T,“  ;  and  i-  The  conditional  variance  of 

X  ,i  y.i  y ) 

y  given  x  may  then  be  written 

o 

2  _  2  ^-y  ! 

’^y\x,i  —  ^y,i  ^2  ' 

We  will  denote  as  n;  the  (possibly  fractional)  number 
of  training  examples  for  which  gi  takes  responsibility: 

j  =  l  Z^A:  =  l  ^i^J-  yj\^’) 

For  an  input  x,  each  gi  has  conditional  expectation  y,- 

and  variance  tr,?  ■ : 

y  I® 

,  ^xy,i  ,  > 

Ui  —  ”b  ^  ~ 


These  expectations  and  variances  are  mixed  according 
to  the  prior  probability  that  gi  has  of  being  responsible 
for  x: 

hi  =  hi{x)  =  - . 

For  input  x  then,  the  conditional  expectation  y  of  the 
resulting  mixture  and  its  variance  may  be  written: 


XI  Vi' 
!  =  1 

N  2 


(•r  - 


In  contrast  to  the  variance  estimate  computed  for  a  neu¬ 
ral  network,  here  cr?  can  be  computed  efficiently  with  no 
approximations. 

3.1  ACTIVE  LEARNING  WITH  A 
MIXTURE  OF  GAUSSIANS 

We  want  to  select  x  to  minimize  With  a  mixture  of 

Gaussians,  the  model’s  estimated  distribution  of  y  given 
X  is  explicit: 


where  the  expectation  can  be  computed  exactly  in  closed 
form: 

-  nifix.i  +  hiX 

hr,i  —  -  I 

rii  -b  hi 

-~2  Tlhii^X 


i^xy,i) 


n  +  hi  {n  +  hi)"^ 
na~  i  +  fi.crg -(x)  ^  nhi(yi{x)  - 
n  +  hi  (n  +  hi)^ 

Ti^Ty,i  nhii^X  fix  hy,*) 

r  I  !!  r  T  j 


(ra  +  hiY 

n~h‘i(ryAx){x  -  fix,i)'^ 


(n  +  hiY 


{^h.d 


4  LOCALLY  WEIGHTED 
REGRESSION 

We  consider  here  two  forms  of  locally  weighted  regression 
(LWR):  kernel  regression  and  the  LOESS  model  [Cleve¬ 
land  et  al,  1988].  Kernel  regression  computes  y  as  an 
average  of  the  y,-  in  the  data  set,  weighted  by  a  kernel 
centered  at  x.  The  LOESS  model  performs  a  linear  re¬ 
gression  on  points  in  the  data  set,  weighted  by  a  kernel 
centered  at  x.  The  kernel  shape  is  a  design  parameter: 
the  original  LOESS  model  uses  a  “tricubic”  kernel;  in 
our  experiments  we  use  the  more  common  Gaussian 

hi{x)  =  h(x  -  Xi)  =  exp(-l;(a;  - 
where  k  is  a.  smoothing  constant.  For  brevity,  we  will 
drop  the  argument  x  for  hi{x),  and  define  n  =  ^i- 

We  can  then  write  the  estimated  means  and  covariances 


Ei 

n 

Ei 


E;  hjjxj  -  xf 
n 

Ei  hiiyi-fiy^ 


—  ...2 
~  ^y 


^xy  Y.ihi{Xi  -  X){yi  -  fly) 

,t2  ’  - 


We  use  them  to  express  the  conditional  expectations  and 
their  estimated  variances: 

kernel:  y=zfiy, 


i  =  1  i  =  1 

where  hi  =  hi{x).  Given  this,  calculation  of 

straightforward:  we  model  the  change  in  each  gi  sep¬ 
arately,  calculating  its  expected  variance  given  a  new 
point  sampled  from  P{y\x,  i)  and  weight  this  change  by 
hi.  The  new  expectations  combine  to  form  the  learner’s 
new  expected  variance 

G>  =  (s) 

^  ^  ^  rii  +  hi  \  crl^  j 


LOESS:  y  =  fly  + -^{x  -  fix), 


(T  \ 

(j?  — 

^  n 


{x  -  fix)" 


4.1  ACTIVE  LEARNING  WITH  LOCALLY 
WEIGHTED  REGRESSION 

Again  we  want  to  select  x  to  minimize  With  LWR, 

the  model’s  estimated  distribution  of  y  given  x  is  explicit: 
P(y|x)  =  V(y(x),t72|^(£)) 


2 


The  estimate  of  is  also  explicit.  Defining  h  as  the 

weight  assigned  to  x  by  the  kernel,  the  learner’s  expected 
new  variance  is 


kernel: 

n  -h  n. 

LOESS:  (i?)  =  ^ 

n  +  n 


where  the  expectation  can  be  computed  exactly  in  closed 
form: 


n/ii  +  hx 
n  +  h 

ncr^  nh{x -- fix)~ 

n  +  h  {n  +  h)"^ 
nal  +  hcl{x)  ^  nh{y{x)  -  fiyf 
n  -\-h  (n  +  hY 


nCTry  nh{x  -  yr)iyix)  -  fly) 

n  +  h  (n  +  hY 


i^xy)  + 


n'^P(Tl{x){x  -  y^Y 

{n  +  JiY 


5  EXPERIMENTAL  RESULTS 


Below  we  describe  two  sets  of  experiments  demonstrat¬ 
ing  the  predictive  power  of  the  query  selection  criteria  in 
this  paper.  In  the  first  set,  learners  were  trained  on  data 
from  a  noisy  sine  wave.  The  criteria  described  in  this  pa¬ 
per  were  applied  to  predict  how  a  new  training  example 
selected  at  point  x  would  decrease  the  learner’s  variance. 
These  predictions,  along  with  the  actual  changes  in  vari¬ 
ance  when  the  training  points  were  queried  and  added, 
are  plotted  in  Figures  1,  2,  3,  and  4. 

In  the  second  set  of  experiments,  we  applied  the  tech¬ 
niques  of  this  paper  to  learning  the  kinematics  of  a  two- 
joint  planar  arm  (Figure  5;  see  Cohn  [1994]  for  details). 
Below,  we  illustrate  the  problem  using  the  LOESS  algo¬ 
rithm. 

An  example  of  the  correlation  between  predicted  and 
actual  changes  in  variance  on  this  problem  is  plotted  in 
Figure  6.  Figures  7  and  8  demonstrate  that  this  cor¬ 
relation  may  be  exploited  to  guide  sequential  query  se¬ 
lection.  We  compared  a  LOESS  learner  which  selected 
each  new  query  so  as  to  minimize  expected  variance  with 
LOESS  learners  which  selected  queries  according  to  var¬ 
ious  heuristics.  The  variance-minimizing  learner  signifi¬ 
cantly  outperforms  the  heuristics  in  terms  of  both  vari¬ 
ance  and  MSE. 


6  SUMMARY 

Mixtures  of  Gaussians  and  locally  weighted  regression 
are  two  statistical  models  that  offer  elegant  representa¬ 
tions  and  efficient  learning  algorithms.  In  this  paper  we 


Figure  1:  The  upper  portion  of  the  plot  indicates  the 
neural  network’s  fit  to  noisy  sinusoidal  data.  The 
lower  portion  of  the  plot  indicates  predicted  and  ac¬ 
tual  changes  in  the  network’s  average  estimated  vari¬ 
ance  when  X  is  queried  and  added  to  the  training  set,  for 
i;  €  [0, 1].  Changes  are  not  plotted  to  scale  with  fits. 


Figure  2:  Fit  to  data  and  correlation  for  a  mixture  of 
Gaussians. 


have  shown  that  they  also  offer  the  opportunity  to  per¬ 
form  active  learning  in  an  efficient  and  statistically  cor¬ 
rect  manner.  The  criteria  derived  here  can  be  computed 
cheaply  and,  for  problems  tested,  demonstrate  good  pre¬ 
dictive  power. 
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Figure  7:  Variance  for  a  LOESS  learner  selecting  queries 
according  to  the  variance-minimizing  criterion  discussed 
in  this  paper  and  according  to  several  heuristics.  “Sen¬ 
sitivity”  queries  where  output  is  most  sensitive  to  new 
data,  “Bias”  queries  according  to  a  bias-minimizing  cri¬ 
terion,  “Support”  queries  where  the  model  has  the  least 
data  support.  The  variance  of  “Random”  and  “Sensitiv¬ 
ity”  are  off  the  scale.  Curves  are  medians  over  15  runs 
with  non-Gaussian  noise. 
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Figure  8:  MSE  for  a  LOESS  learner  selecting  queries 
according  to  the  variance-minimizing  criterion  discussed 
in  this  paper  and  according  to  the  heuristics  described 
in  the  previous  figure. 
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